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Abstract — The discovery of motifs underlying gene expression 
is a challenging one. Some of these motifs are known transcription 
factors, but sequence inspection often provides valuable clues, 
even discovery of novel motifs with uncharacterized function in 
gene expression. Coupled with the complexity underlying tissue- 
specific gene expression, there are several motifs that are puta- 
tively responsible for expression in a certain cell type. This has 
important implications in understanding fundamental biological 
processes, such as development and disease progression. In this 
work, we present an approach to the principled selection of 
motifs (not necessarily transcription factor sites) and examine 
its application to several questions in current bioinformatics 
research. 

There are two main contributions of this work: Firstly, we 
introduce a new metric for variable selection during classification 
, and secondly, we investigate a problem of finding specific 
sequence motifs that underlie tissue specific gene expression. In 
conjunction with the SVM classifier we find these motifs and 
discover several novel motifs which have not yet been attributed 
with any particular functional role (eg: TFBS binding motifs). We 
hypothesize that the discovery of these motifs would enable the 
large-scale investigation for the tissue specific regulatory potential 
of any conserved sequence element identified from genome-wide 
studies. 

Finally, we propose the utility of this developed framework 
to not only aid discovery of discriminatory motifs, but also to 
examine the role of any motif of choice in co-regulation or co- 
expression of gene groups. 

Index Terms — Nephrogenesis, Directed Information, Tran- 
scriptional regulation, phylogeny, protein-protein interaction. 
Transcription factor Binding sites (TFBS), GATA genes, T-cell 
activation, comparative genomics, tissue specific genes. 



I. Introduction 

Understanding the mechanisms underlying regulation of 
tissue specific gene expression is still a challenging question 
that does not have a satisfactory explanation. While all mature 
cells in the body have a complete copy of the human genome, 
each cell type only expresses those genes it needs to carry 
out its assigned task. This includes genes required for basic 
cellular maintenance (often called "house keeping genes") 
and those genes whose function is specific to the particular 
tissue type the cell belongs to. Gene expression by way of 
transcription is the process of generation of messenger RNA 
(mRNA) from the DNA template representing the gene. It 
is the intermediate step before the generation of functional 
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Fig. 1 . Schematic of Transcriptional Regulation. 



protein from messenger RNA. During gene expression (Fig. 
[T]i, transcription factor proteins are recruited at the proximal 
promoter of the gene as well as at sequence elements (en- 
hancers/silencers) which can lie several hundreds of kilobases 
from the gene's transcriptional start site. 

The combination of the basal transcriptional machinery at 
the promoter coupled with the transcription factor complexes 
at the distal regulatory elements are collectively involved in 
directing tissue specific expression of genes. Some of the 
common features of these distal regulatory elements are: 

• Non-coding elements: Distal regulatory elements are non- 
coding and can eitehr be intronic or intergenic region 
on the genome. Hence previous models of gene finding 
are not applicable herein. Moreover, with over 98% 
of the annotated genome being non-coding the precise 
localization of regulatory elements that underlie tissue- 
specific gene expression is a challenging problem. 

• Distance/orientation independent: An enhancer can act 
from great genomic distances (hundreds of kilobases) to 
regulate gene expression in conjunction with the proximal 
promoter, possibly via a looping mechanism. They can lie 
upstream or downstream of the actual gene. 

• Promoter dependent: Since the action at a distance of 
these elements involves the recruitment of transcription 
factors that direct tissue-specific gene expression, the 
promoter that they interact with is critical. 

Though there are instances where a gene harbors tissue 
specific activity at their promoter itself, there is considerable 
interest to examine the long range elements (LREs) for a 
detailed understanding of their regulatory role in gene expres- 
sion during biological processes like organ development and 
disease progression [39]. An in-silico examination of the se- 
quence properties of such LREs would result in computational 
strategies to find novel LREs genomewide that govern tissue 
specific expression for any gene of interest. 



2 



Our primary question in this regard is: are there any 
discriminating sequence properties of such LRE elements that 
determine tissue specific gene expression - more particularly, 
are there any sequence motifs in such regulatory elements 
that can aid discovery of new elements with similar potential 
for tissue-specific regulation genomewide [22]. We remind 
the reader that a similar approach was used in gene finding 
algorithms, wherein sequence features of exons are examined 
to facilitate the discovery of novel genes. In popular gene find- 
ing algorithms [21], discriminatory motifs are identified from 
annotated genes and non-coding regions. Such an approach can 
be extended to this situation too wherein we look for motifs 
discriminating regulatory and neutral non-coding elements. 
In this work, we explore the possible existence of such 
discriminating sequence motifs in two kinds of biologically 
relevant regulatory sequences: 

• Promoters of tissue specific genes: Before the 
widespread discovery of long-range regulatory elements 
(LREs) it was hypothesized that promoters governed 
gene expression entirely. There is substantial evidence 
for the binding of tissue-specific transcription factors at 
the promoters of expressed genes. This suggests that, in 
spite of newer information impUcating the role of LREs, 
promoters might also have interesting motifs that govern 
tissue-specific expression, that are potentially relevant to 
the discovery of new LREs de-novo. 

Another practical reason for the examination of pro- 
moters is that their locations (and genomic se- 
quences) are more unambiguously delineated on genome 
databases (like UCSC or Ensembl). Sufficient data 
(http://symatlas.gnf.org) on the expression of genes is 
publicly available for analysis. 

We set up the motif discovery as a feature extraction 
problem from these tissue-specific promoter sequences 
and then build a support vector machine (SVM) classifier 
to classify new promoters into specific and non-specific 
categories based on the identified sequence features (mo- 
tifs). Using the SVM classifier algorithm we are able to 
accurately classify 70% of tissue specific genes based 
upon their upstream promoter region sequences alone. 

• Known long range regulatory elements (LRE) motifs: 
To analyze the motifs in LRE elements, there is a need 
for an experimental dataset of elements that confer 
tissue-specific expression in some eukaryotic animal 
model. For our purpose, we examine the results of this 
approach on other new data source of interest - the 
Enhancer Browser at http://enhancer.lbl.gov which has 
results of expression of ultraconserved genome elements 
in transgenic mice [23]. An examination of these 
ultraconserved enhancers is useful for the extraction 
of discriminatory motifs to distinguish these regulatory 
elements from non-regulatory (neutral) ones. Here the 
results indicate that upto 90% of the sequences can be 
correctly classified using these identified motifs. 

We note that some of the identified motifs might not be 
transcription factor binding motifs, and would need to be 
functionally characterized. This is an advantage of our method 



- instead of constraining ourselves by the degeneracy present 
in TF databases (like TRANSFAC/JASPAR), we look for all 
sequences of a fixed length. 

II. Contributions 

From microarray expression data, ([11],[12]) proposes an 
approach to assign genes into tissue specific and non-specific 
types using an entropy criterion. They use the variation in 
expression and its divergence from ubiquitous expression 
(uniform distribution across all tissue types) to make this 
assignment. From this assignment, several features like CpG 
island density, frequency of transcription factor motif occur- 
rence, can be examined to discriminate these two subgroups. 
However, a denovo examination of 'every' sequence feature 
in sequence and its subsequent interpretation has not been 
pursued adequately. Other work has explored the existence of 
key motifs (transcription factor binding sites) in the promoters 
of tissue-specific genes [25]. 

For the purpose of identifying discriminative motifs from 
the training data (tissue-specific promoters or LREs), our 
approach is two-pronged: 

• Variable selection: We first find those sequence motifs 
that can discriminate between tissue-specific and non- 
specific elements. In machine learning, this is a feature 
selection problem. Here, these features are the counts of 
sequence motifs in these training sequences. Without loss 
of generality, we can use six-nucleotide motifs (hexam- 
ers) as the motif features. This is based on the observation 
that most transcription factor binding motifs have a 5- 
6 nucleotide core sequence with degeneracy at the ends 
of the motif. Another reason is that we need to find a 
length that captures biologically meaningful information 
as well as does not lead to an unduly large search space. 
Even though our motivation for choosing hexamers was 
independent, a similar setup has been referred to in ([2], 
[5]). We find that 4^ possibilities (from hexamer se- 
quences) yields good performance without being unduly 
costly, computationally. The overall presented approach, 
however, does not depend of this choice of motif length 
and can be scaled depending on biological intuition. 

In the first part of the work in this paper, we present 
a novel feature selection approach (based on a new 
information theoretic quantity called directed information 
- DTI) that can be applied to such learning problems. 

• Classifier design: After discovering key discriminating 
motifs from the above DTI step, we proceed to build 
a SVM classifier that separates the samples between the 
two classes (specific and non-specific) from this feature 
space. One distinction we make is that the class label 
is a perceptual abstraction (based on our intuition and 
experience). The feature space (hexamer counts) is a 
physical/measurement space - this is what we can look 
for. 

Apart from the novel feature selection approach, our con- 
tributions to bioinformatics methodology are outlined below: 
From the identified motifs, we ask several related questions:. 
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• Are there any common motifs identified from tissue- 
specific promoters and corresponding enhancers, both 
underlying expression in the same tissue?. To answer this, 
we examine brain-specific promoters and enhancers. 

• Do these motifs correlate with known motifs (transcrip- 
tion factor binding sites), 

• how useful are these motifs in predicting new regulatory 
elements?. 

This work differs from that in ([2], [5]), in several aspects. 
We present the novel DTI based feature selection procedure as 
part of an overall unified framework to answer several ques- 
tions in bioinformatics, not limited to finding discriminating 
motifs between two classes of sequences. Particularly, one of 
the advantages of the proposed approach is the examination 
of any particular motif as a potential discriminator between 
two classes. Also, this work accounts for the notion of tissue- 
specificity of promoters/enhancers (in line with more recent 
work in [41], [23], [1 1], [12], [40]). This is clarified further in 
the Results (Sections: XI and XII). After solving the main 
problem posed in this work viz, identification of tissue specific 
motifs from annotated sequences (promoter/LREs), we will 
then examine some other related problems which would benefit 
from such an analysis. 

III. Rationale 

Some of the approaches to finding motifs relevant to certain 
classes with respect to examining common motifs driving gene 
regulation is summarized in ([43], [44]). The most common 
approach is to look for TFBS motifs (TRANSFAC / JASPAR) 
that are statistically over-represented based on a binomial or 
Poisson model in the promoters of the co-expressed genes. 
This assumes a parametric form on the background density of 
distribution of motifs in promoters. 

In our situation, we set-up the problem of discriminative 
motif discovery as a word-document classification problem. 
Having constructed two groups of genes for analysis, tissue 
specific ('ts') and non-tissue specific ('nts') - we are interested 
to find hexamer motifs which are most discriminatory between 
these two classes. Our goal would be to make this set of motifs 
as small as possible - i.e. to achieve maximal class partitioning 
with the smallest feature subset. This is the classic feature- 
selection problem. 

Several metrics have been proposed that can find features 
which have maximal association with the class label. In 
information theory, mutual information is a popular choice. 
This is a symmetric association metric and does not resolve 
the direction of dependency (i.e if features depend on the class 
label or vice versa). It is important to find features that are 
induced by the class label. When we capture features relevant 
to a data sample, feature selection implies selection (control) 
of a feature subset that maximally captures the underlying 
character (class label) of the data . We have no control over 
the label (a purely perceptual characterization). 

With this motivation, we propose the use of a new metric, 
termed "Directed Information" (DTI) for finding such a dis- 
criminative hexamer subset. Subsequent to feature selection, 
we design a Support Vector Machine (SVM) classifier to 
classify sequences that are tissue-specific or not. 



As expected, the input to such an approach would be a gene 
promoter - motif frequency table (Table The genes relevant 
to each class are identified from tissue microarray analysis, 
and the frequency table is built by parsing the gene promoters 
for the presence of each of the 4^ = 4096 possible hexamers. 

Ensembl Gene ID AAAAAA AAAAAG AAAAAT AAAACA 



ENSG00000155366 1 4 

ENSGOOOOO 1780892 6 5 5 6 

ENSGOOOOOl 89171 12 10 

ENSGOOOOO 168664 6 3 8 

ENSGOOOOO 1609 17 4 1 4 2 

ENSGOOOOO 163655 2 4 1 

ENSGOOOOO 1228844 8 6 10 7 

ENSGOOOOO 176749 

ENSG00000006451 5 2 2 1 



TABLE 1 

The 'motif frequency matrix' for a set of gene-promoters. The 

FIRST column is THEIR ENSEMBL GENE IDENTIFIERS AND THE OTHER 
4 COLUMNS ARE THE MOTIFS. A CELL ENTRY DENOTES THE NUMBER OF 
TIMES A GIVEN MOTIF OCCURS IN THE UPSTREAM (-2000 TO +1000BP 
FROM TSS) REGION OF EACH CORRESPONDING GENE. 

IV. Methods 
Below we present our approach to find promoter specific or 
enhancer-specific motifs. 

A. Promoter motifs: 

1) Microarray Analysis: Raw microarray data 
was obtained from the Novartis Foundation (GNF) 
[http://symatlas.gnf.org/]. Data was normalized using RMA 
from the Bioconductor packages for R [cran.r-project.org/]. 
Following normalization, replicate samples were averaged 
together. Only 25 tissue types were used in our analysis 
including: Adrenal, Amygdala, Brain, Caudate Nucleus, 
Cerebellum, Corpus Callosum, Cortex, Dorsal Root Ganglion, 
Heart, HUVEC, Kidney, Liver, Lung, Pancreas, Pituitary, 
Placenta, Salivary, Spinal Cord, Spleen, Testis, Thalamus, 
Thymus, Thyroid, Trachea, and Uterus. 

In order to classify genes as tissue specific or not, we had 
to define what tissue-specific expression means in our context. 
The notion of tissue-specificity is fairly ambiguous. We define 
a gene as being tissue specific if it is expressed in no more than 
three tissue types. We also defined non-tissue specific genes 
as those being expressed in at least 22 of the 25 tissue types 
we examined. A binary assignment method was employed to 
determine if a gene was highly expressed in a given tissue 
type. In this method, any gene whose expression level was at 
least two-fold greater than the median expression level for the 
tissue type was considered to be highly expressed and was 
assigned a score of one. Genes not meeting this requirement 
were given an assignment of zero for that particular tissue 
type. Using this approach a single numerical summary could 
be achieved for every gene (across all tissue types). This value 
could be used to find how many genes were highly expressed 
in most tissue types and those that were only expressed in 
a few. We note that, this also allows ample flexibility to a 
biologist to examine the genes that she is interested in. 



4 



Suppose there are genes, gi, 32, • • • , ffw and T 
tissue types (in GNF: T = 25), we construct a N x T 
tissue specificity matrix : M = [0]nxt- For each gene 
gi,l<i< N, let 5ijo.5T] = median{gi^k)yk e 1,2, ...,T. 
Define, each entry Mi^k as, 

M — / if fi.fe - ^^.[O.ST]; 

^^^'■'^ ~ \ otherwise. 



Now consider the N dimensional vector rrii = 
Y^'^=i Mi,k, 1 < i < N i.e. summing all the columns of each 
row. Plot the inter-quartile range of 'm'. For the gene indices 
'i' lying in quartile 1 (=3), label as 'ts', and for indices in 
quartile 4 (= 22), label as 'nts'. 

With this approach, a total of 1924 probes representing 
1817 genes were classified as tissue specific, while 2006 
probes representing 2273 genes were classified as non-tissue 
specific. For our studies, we consider genes which are either 
heart-specific or brain- specific. From the tissue-specific genes 
obtained in the above approach, we find 55 gene promoters 
that are brain specific and 118 gene promoters that are heart- 
specific. The objective in this work is to find motifs that are 
responsible for brain/heart specific expression and possibly 
correlate them (atieast a subset) with binding profiles of known 
transcription factor binding motifs. 

2) Sequence Analysis: Genes associated with candi- 
date probes were identified using the Ensembl Ensmart 
[http://www.ensembl.org/]. For each gene, we extracted 
2000bp upstream and lOOObp down-stream upto the start of 
the first exon relative to their reported transcriptional start site 
in the Ensembl Genome Database (Release 37). The relative 
counts of each of the 4® hexamers were computed within 
each gene-promoter sequence of the two categories ('ts' and 
'nts') - using the 'seqinr' Ubrary in the R environment to 
parse these sequences and obtain the frequency of occurrence 
(counts) of each hexamer in a sequence.. A paired t-test 
was performed between the relative counts of each hexamer 
between the two expression categories ('ts' and 'nts') and 
the top 1000 significant hexamers having a p- value less than 
10~^ were selected (X = Xi, X2, . . . , Xiooo)- The relative 
counts of these hexamers was computed again for each gene 
individually. This resulted in two hexamer-gene co-occurrence 
matrices, each with Ntrain rows of genes and M = 1000 
columns of hexamers - one for the 'ts' class and the other for 
the 'nts' class. We note that Ntrain = ^in{Sts, Snts) with 
Sts being the number of positive training ('ts') samples and 
Snts being the number of negative training ('nts') samples. 
This is done to avoid bias problems during learning. 

B. LRE motifs: 

To analyze long range elements which confer tissue spe- 
cific expression, we examine the Mouse Enhancer database 
{http://enhancer.lbl.gov/). Briefly, this database has a hst of 
experimentally validated ultraconserved elements which have 
been tested for tissue specific expression in transgenic mice 



[23]. This database can be searched for a list of aU elements 
which have expression in a tissue of interest. In our case, 
we consider expression in tissues relating to the developing 
brain. We note that according to the experimental protocol, the 
various regions are cloned upstream of a heat shock protein 
promoter, not adhering to the idea of promoter specificity in 
tissue-specific expression. Though this is of concern in that 
there is loss of some gene-specific information, we work with 
this data since we are more interested in tissue expression and 
also because there is paucity of public enhancer-dependent 
data . 

This database also has a collection of ultraconserved el- 
ements that do not have any transgenic expression in-vivo. 
This is the neutral/background set of data which corresponds 
to the 'nts' (non-tissue specific class) during feature selection 
and classifier design. 

As in the above (promoter) case, we can parse these 
sequences (sixty two enhancers for brain-specific expression) 
for the absolute counts of the 4096 hexamers, build a co- 
occurrence matrix {Ntrain = 62) and then use t-test p-values 
to find file top 1000 hexamers (X' = X(, X^, . . . , X[qqq) that 
are maximally different between the two classes (brain- specific 
and brain non-specific). 

The next three sections clarify the preprocessing, feature se- 
lection and classifier design steps to mine these co-occurrence 
matrices for hexamer motifs that are strongly associated with 
the class label. We note that though we illustrate this work 
using two class labels, the method can be extended in a 
straightforward way to the multi-class problem. 

V. Preprocessing 

From the above, we now have Ntrain x 1000 co-occurrence 
matrices each for the tissue-specific and non-specific data, both 
for the promoter and enhancer sequences. Before proceeding 
to the feature (hexamer motif) selection step, we would need 
to normalize the counts of the M = 1000 hexamers in each 
training sample. For this, we can obtain an interquartile range 
of the hexamer counts in each gene, and create equivalent co- 
occurrence matrices of dimension Ntrain x 1000 where each 
entry is the quantile membership of the hexamer count. In this 
work, we use a (K = 4)-quantile label assignment. 

In this co-occurrence matrix, let gci^k represents the abso- 
lute count of the fc*'' hexamer, k G 1, 2, . . . , M in the i*'' 
gene. Then, for each gene gi, the quantile labeled matrix has 

= I if g%[i^M] ^ 9Ci,k < 9%[j^_M\ 
We can now construct matrices of dimension Ntrain x 1001 
for each of the specific and non-specific training samples. Each 
matrix would contain the quantile label assignments for the 
1000 hexamers (Xj, i e (1, 2, . . . , 1000)), as stated above, and 
the last column would have the class label (Y ~ —1/ + V). 
These two matrices are then integrated into one composite 
training data matrix of dimension 2N train x 1001. 

VI. Directed Information and Feature Selection 

The primary goal in feature selection is to find the minimal 
subset of features (from hexamers: -^^^,1:1000) that lead to max- 
imal discrimination of the class label (1^ e (— 1/ -|- 1)), using 
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each of the i G (1,2,..., 2Ntrain) genes for training. We are 
looking for a subset of the variables (X^.i, . . . , Xi.iooo) which 
are directionally associated with the class label (Yi). These 
hexamers putatively influence/induce the class label (Fig. |2]i. 
As can be seen from [http://research.ihost.com/cws2006/], 
there is considerable interest in using causality to solve this 
problem by discovering dependencies from the given data. Our 
interpretation [10], is to find features (in measurement space) 
that induce the class label (in perceptual space). 
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Fig. 3. Directed Information setup under Source coding with feedforward 



Fig. 2. Causal Feature discovery, adapted from [10]. Here the variables X\ 
and X2 discriminate Y . 



There has been a lot of previous work exploring the 
feasibility of using mutual information (MI) as a method to 
infer such conditional dependence/influence between features 
and class labels [45] by exploring the structure of the joint 
distribution of motif frequency profiles from the count matrix. 
However, this metric is undirected and does not resolve 
whether the hexamers induce the class label or vice-versa. 
This resolution is essential since one can only control the 
physical/measurement feature space, whereas the perceptual 
space (class label) remains the same. Hence, the only freedom 
we have during learning is: which features, among the ones 
that we collect, are maximally representative of the class label 
or data type. The absence of such a 'directed' information 
theoretic metric has prevented us from exploitation of the 
full potential of information theory. We thus examine the 
Directed Information (DTI) criterion as a potential metric to 
the explicit inference of feature influence. This enables us to 
uncover any meaningful relationship between features {Xi) 
and class label (Y). In a regression (state-space) framework, 
the measurements are the state variables and the class label is 
the observation. 

A brief background about DTI is in order: Directed In- 
formation comes from a rich literature regarding capacity of 
channels with feedback or understanding rate-distortion theory 
in source coding with feedforward [19]. Source coding and 
channel coding are information-theoretic duals, and DTI is 
useful to characterize source or channel behavior when the 
information being transferred is correlated [20]. 

The relationship between MI and DTI is given by, 

MI: I{X^-Y^) = E»=i-f(^";^^l^'"^) = 
Y^) + I{QY^-^ ^ X^). 
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Fig. 4. Directed Information setup under Channel Coding with feedback 



DTI: I{X^ ^Y^) ^J2ZiHX';Y,\Y'-^). 

Just like the case where we maximize I{X; Y) in channel 
coding, or minimize I{X;Y) for source coding in cases 
without feedback, we maximize/minimize I{X — > Y) in the 
corresponding cases with feedback/feedforward. This feature 
selection problem for the i*^ training instance becomes one 
of identifying which hexamer (fc g (1,2,..., 4096)) has the 
highest I{Xi^}. Y^) or the lowest /(Kj Xi ^) if looked 
from the source or channel coding perspectives respectively. 

In our case, we treat the hexamers (Xi) like messages 
and the class labels (Y) as the reconstruction after transmis- 
sion and are interested in asking which messages transmit 
maximum information to the reconstructed versions. These 
are the features which maximally influence the class label. 
At each instant (in time or training iteration), the learning 
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algorithm uses the previous associations between the hexamer 
features (X^ i, 2, • • ■ , Xk,i-i) and corresponding class la- 
bels {Yi,Y2, ■ ■ ■ ,Yi-i) to predict a class label (Yi) for the 
current hexamer feature instance (Xi). Following the concept 
of feedforward, the true class label Yi is then given to the 
algorithm (after the prediction Yi) is made. Based on the 
prediction (Yi) and the true class label Yi, the algorithm will 
learn the hexamer-label associations up until the instance it 
has just observed (i.e. the i*'' instant), and so on. It does 
this for each hexamer 'k' in the feature vector, and finds 
the hexamer which has maximum association within this 
sequential learning paradigm (Figs |3] and lU. Each dotted box 
represents the DTI based feature learning for each hexamer. 
Thus, this is thought of M — 1000 operations in parallel. 
The training iterations proceed as long as it takes to achieve a 
certain accuracy of classification. We would expect that DTI 
needs lesser number of features than MI to achieve the same 
classification rate. 

The DTI (for a lag of 1) - which is a measure of the 
causal dependence between two random processes Xi ~ 
[Xi^i,X2,i, ■ ■ ■ Xn.i]^ (with Xj^i = quantile label for the fre- 
quency of hexamer i e (1,2,..., 1000) in the j*^ training 
sequence) and Y ~ [Yi, I2, • ■ • , Yn] being the corresponding 
class labels (— 1/ + 1), is given by [6]: 



N 



(1) 



I{X 



N 



Y 



N 

E 

n=l 



ij(x"|y")] 



N 



(using /(X";y") = iJ(X") - iJ(X"|r")) 

N 

= XI [/(X"; r") - /(X"; r"-i)] 

n=l 



N 

E 

n=l 



[/(X";y")-/(X";Oy"-i)] 



Here, N is 2Ntrain, the total number of training data class 
labels. As already known, the mutual information I{X\ Y) = 
H{X) - H{X\Y), with H{X) and H{X\Y) being the Shan- 
non entropy of X and the conditional entropy of X given Y, 
respectively. Using this definition of mutual information, the 
Directed Information simplifies to, 

N 

I{X^ Y^) = X[iJ(X"|F"-i) - i?(X"|y'")] 

ri=l 

N 

= ^{[ff (X", r"-i) - ij(r"-i)] - [H{x'\ y") - iJ(r")]} 

n=l 

(2) 

Using (2), the Directed information is expressed in terms of 
individual and joint entropies of X and Y. For the purpose of 
entropy estimation, we can adopt several approaches. In this 
work we examine two different ways to estimate the directed 
information from the hexamer-sequence frequency matrix. 

The first way is to bin each frequency vector into L 
quantiles. Thus, within the i*'* sequence (promoter/LRE), we 
can find the distribution of the hexamers within the sequence 
and bin them into the appropriate quantile. The value/entry in 
each cell of Table I is Z e (1, 2, . . . , L). The last element in 
the data vector is the class label (— 1/ + 1). Hence, it is now 
straightforward to find the marginal and joint distributions for 
the A;*'' hexamer (Xi^k) and the class label (Yi). 

An alternate method to find the joint information of the 
random variables X^ and Y^ uses the Darbellay-Vajda 
algorithm [7]. From (2), we also have. 



In the above expressions, the mutual information be- 
tween two random variables in the sum (/(X"; F") and 
I{X"; OF"^^)) can be estimated using a non-parametric adap- 
tive binning procedure ([35], [7]) by iterative partitioning 
of the observation space until conditional independence is 
achieved within and between partitions. This method lends 
itself to a tree based partitioning scheme and can be used for 
entropy estimation even for a moderate number of samples in 
the observation space of the underlying probability distribu- 
tion. Several such algorithms for adaptive density estimation 
have been proposed ([32], [31], [33], [34]) and can find po- 
tential application in this procedure. Because of the higher 
performance guarantees in using this procedure as well as 
the relative ease of implementation, we use the Darbellay- 
Vajda approach for entropy (and information) estimation in 
our methodology. 

Both these methods (equiquantization and Darbellay-Vajda) 
provide a way to estimate the true DTI between a given 
hexamer and the class label for the entire training set. Feature 
selection comprises of finding all those hexamers (Xi) for 
which I{Xi Y) is the highest. From the definition of 
DTI, we know that < I{X^ Y) < I{Xf,Y) < 00. To 
make a meaningful comparison of the strengths of association 
between different hexamers and the class label, we need to 
find a normalized score (Sec:VII) to rank the DTI values. 
Another point of consideration is that we need to ask how 
significant the DTI value is compared to a null distribution 
on the DTI value (i.e. what is the chance of finding the DTI 
value by chance from the series Xi and Y). This is done using 
confidence intervals after permutation testing (Sec: IX). 

A NORMALIZED DTI MEASURE 

In this section, we derive an expression for a 'normalized 
DTI coefficient'. This is useful for a meaningful comparison 
across different criteria during network inference. For now, 
we will compare the network influences as inferred from 
normalized DTI and CoD [14]. In this section, we use X,Y, Z 
for X^, Y^ and interchangeably, i.e X = X^ , Y = Y^ , 
and Z = Z^. 

By the definition of DTI, we can see that < I{X^ 
Y^) < I{X^ [Y^) < 00. The normalized measure pdti 
should be able to map this large range ([0, 00]) to [0, 1]. 
We recall that the multivariate canonical correlation is given 



by [26]: Px"-Y^ 



.-1/2^ 



.-1/2 
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and this is normalized having eigenvalues between and 1 . We 
also recall that, under a Gaussian distribution on and , 
the joint entropy H{X^;Y^) = -i ln(27re)2^|E^iv^iv |, 
where \A\ is the determinant of matrix A, E denotes the 
covariance matrix. 

Thus, for I{X^]Y^) = H{X^)+H{Y^)-H{X^ -Y^), 
the expression for mutual information, under jointly Gaussian 
assumptions on X^ and Y^ , becomes, 

I{X-Y) = -iln( |g--j^;| ) = -iln(l-p^„^^„). 
Hence, a straighlforward tra nsformation is normalized MI, 
PMi = Vl - e^-2/(^;y) = vT~7^2eCT3?^^WF^ . A 
connection with [27], can thus be immediately seen. 

Both of these will be normalized between (0, 1) and will 
give a better absolute definition of dependency that does not 
depend on the unconditioned MI. We will use this definition 
of normaUzed information coefficients for the present set of 
simulation studies. 

For constructing a normalized version of the DTI, we 
can extend this approach , from ([13], [27]). Consider three 
random vectors X, Y and Z, each of which are identically 
distributed as A/'(/Ux, Ejc jc), A/'(/LtY, Eyy), and M{nz, ^zz) 
respectively. We also have. 



IJ-x ^ 




( Ejsfx 




^XZ 










'^YZ 


A^z / 






'^ZY 


'^ZZ 



(X,Y,Z) ~ 
Their partial correlation 8yx\z is given by, 



a\ = Y^yy — ^YZ^zz^zY, 0-2 = — "^yz^zz 

= EjjfX — ^XZ^^z^ZX 



^^zx, 



Recalling results from conditional Gaussian distributions, 
these can be denoted by: ai = Ey|z, 02 = Yxy\z and o-s = 



^x\z- 



X /2 1/2 

Tjxy\zYx\7. ■ Extending the above 



Thus, 5yx\z — ^Y\z ^^"^\z^x\z 
result from the mutual inf ormation to the directed inf ormation 
case, we have, pdti = "v/l — e^^ ^Dili 

To once again clarify, we recall the primary difference 
between MI and DTI, (note the superscript on X) 
MI: = E^=l'^(^^;^il^*"^) = ^(^^ ^ 

y^) + /(oy^-i->x^). 

DTI: 



ri-\\ 



Kernel Density Estimation (KDE) 

The goal in density estimation is to find a probability density 
function f{z) that approximates the underlying density f{z) 
of the random variable Z. Under certain regularity conditions, 
the kernel density estimator at the point x is given 

by A(Z) = with n being the number 

of samples zi , Z2 , . . . , z„ from which the density is to be 
estimated, h is the bandwidth of a kernel K{*) that is used 
during density estimation. 

A Kernel density estimator at z works by weighting the 
samples (in (zi, 2:2, • • • ) -Zn)) around 2: by a kernel function 
(window) and counts the relative frequency of the weighted 
samples within the window width. As is clear from such 



a framework, the choice of kernel function and the 

bandwidth h determines the fit of the density estimate. 

Some figures of merit to evaluate various kernels are the 
asymptotic mean integrated squared error (AMISE), bias- 
variance characteristics and region of support [29]. It is pre- 
ferred that a kernel have a finite range of support, low AMISE 
and a favorable bias-variance tradeoff. The bias is reduced 
if the kernel bandwidth (region of support) is small, but has 
higher variance because of a small sample size. For a larger 
bandwidth, this is reversed (ie large bias and smaller variance). 
Under these requirements, the Epanechnikov kernel has the 
most of these desirable characteristics - i.e. a compact region 
of support, the lowest AMISE compared to other kernels, and 
a favorable bias variance tradeoff [29]. 

The Epanechnikov kernel is given by: 

K{u) = \{\-v?)I{\u\<\). 

with /(•) being the indicator function conveying a window of 
width spanning [—1,1] centered at 0. An optimal choice of 
the bandwidth is /i = 1.06 x ct^ x n^^l^\, following [[28]]. 
Here dz is the standard error from the bootstrap DTI samples 

(^1, 22, . . . , Zn). 

Hence the kernel density estimate for the bootstrapped DTI 
(with n ^ 1000 samples), Z = Ib{X^ Y^) becomes, 

hiz) = ;^Er=i![i-(^mi 

2.67(Tr and n 



< 1) with h 



1000. We note that Ib{X 



N 



Y 



N\ 



IS 

obtained by finding the DTI for each random permutation of 
X, Y time series, and performing this permutation B times. 

VII. Bootstrapped Confidence Intervals 

Since we do not know the true distribution of the DTI 
estimate, we find an approximate confidence interval for the 
DTI estimate (I{X^ Y^)), using bootstrap above [30]. 

We denote the cumulative distribution function 
(over the Bootstrap samples) of I{X^ — > Y^) by 
^'/s(x"^Y")(-fs(^^ ^ '^^))- Let the mean of the 
bootstrapped null distribution be /^(X^ Y^). We denote 
by ti-a, the (1 — a)*^ quantile of this distribution i.e. 



{ti-a : P{[- 



Since we need the real I{X^ Y^) to be significant 
and as close to 1, we need i{X^ Y^) > [/^(X^ 

—a X with IT being the standard error of the 
bootstr apped distribution, 

(T = Y ^ ''-^ ^ —; B IS the number of 

Bootstrap samples. 

VIII. Support Vector Machines 

From the top "d" features identified from the ranked Ust 
of features having high DTI with the class label, a hyper- 
plane linear classifier in these "d" dimensions is designed. 
A Support Vector Machine (SVM) is a hyperplane classi- 
fier which operates by finding a maximum margin linear 
hyperplane to separate two different classes of data in high 
dimensional {D > "d") space. Our training data has Ntrain 
pairs {xi,yi), (0:2,^2), • • • , {xNtrain^VNtrain)^ wlth Xi e 
andyi e {-1,+1}. 



g 



An SVM is a maximum margin hyperplane classifier in a 
non-linearly extended high dimensional space. For extending 
the dimensions from d to D > d, a radial basis kernel is used. 

The objective is to minimize in the hyperplane {x : 

f{x) = x'^P + /3o}, subject to 

y^ixj(3 + /3o) > 1 - CzVi, C» > 0, E 6 < constant [29]. 

IX. Summary of Overall Approach 

Our proposed approach is as follows. Here, the term 'se- 
quence' can pertain to either tissue specific promoters or LRE 
sequences. 

• Parse the sequence to obtain the relative 
counts/frequencies of occurrence of the hexamer in 
that sequence and build the hexamer-sequence frequency 
matrix. The 'seqinr' package in R is used for this purpose. 
This is done for all the sequences in the specific (class 
+ 1) and non-specific (class —1) categories. The matrix 
thus has Ntrain TOWS and 4^ = 4096 columns. 

• Preprocess the obtained hexamer-sequence frequency ma- 
trix by finding the quantile labels for each hexamer within 
the j*'* sequence. We now have a hexamer-sequence 
matrix where the cell has the quantile label of the 
J*'* hexamer in the i*'* sequence. This is done for all 
the N{— Sts + Snts) training sequences consisting of 
examples from the —1 and +1 class labels. 

• Build two submatrices corresponding to the two class la- 
bels. Thus one matrix will contain the hexamer-sequence 
quantile labels for the positive training examples and the 
other matrix is for the negative training examples. 

• To pick the hexamers that are most different between the 
positive and negative training examples, we perform a 
paired t-test for each hexamer. Rank all the corresponding 
t-test p-values from lowest to highest and take the top 
1000 hexamers. These correspond to the 1000 hexamers 
that are most different distributionally (in mean) between 
the positive and negative training samples. We note that 
the t-test requires the same number of samples in the 
positive (Sts samples) and negative [Snts samples) train- 
ing set. Hence, we consider Ntrain — rnin{Sts, Snts) 
examples for each of the positive and negative training 
cases. Another way to resolve this problem is to find the 
symmetrized KL divergence/ Jensen-Shannon divergence 
between the hexamer distributions of the positive and 
negative examples. This step is only necessary to reduce 
the computational complexity of the overall procedure - 
finding the DTI between each of the 4096 hexamers and 
the class label is very expensive. 

• For the top K — 1000 hexamers which are most 
significantly different between the positive and negative 
training examples, we proceed to find /(Xj ^ Yi) 
and I{Xi,k Yi) for each of the k € (1,2,..., if) 
hexamers. The entropy terms in the directed information 
and mutual information expressions can be found either 
from the equidistant binning approach or the Darbellay- 
Vajda approach. Since the goal is to maximize I{Xi^k 
Yi) or minimize I{Yi ^i,fc), we can rank them in 
descending and ascending order, respectively. Using the 



procedure of Section. VII, the raw DTI values can be 
converted into their normalized versions. 

> We also find the significance of the DTI estimate obtained 
in the step above. Thus if we set a threshold of 0.05 
significance, we can take every hexamer whose DTI 
is 0.05 significant with respect to its bootstrapped null 
distribution (using kernel density estimation), and rank 
the hexamers by decreasing DTI value. The top "d" 
hexamers in this ranked list can be used for classifier 
(SVM) training. 

• We now train the Support Vector Machine classifier 
(SVM) on the top "d" features from the ranked DTI 
list(s). For comparison with the MI based technique, we 
use the hexamers which have the top "d" (normalized) 
MI values. We can now plot the accuracy of the trained 
classifier as a function of the number of features (d). As 
we gradually consider higher "d", we move down the 
ranked list. 

X. Results 

A. Tissue specific promoters 

We use DTI to find discriminating hexamers that underlie 
brain specific and heart specific expression. 

Results for the MI and DTI methods are given below (Figs|5] 
and|6l). The plots indicate the cross-validated misclassification 
accuracy (ideally 0) for the data as the number of features 
using the metric (DTI or MI) is gradually increased. We can 
see that for any given classification accuracy, the number of 
features using DTI is less than the corresponding number of 
features using MI. This translates into a lower misclassification 
rate for DTI-based feature selection. We observe that as the 
number of features " d" is increased the performance of MI is 
the same as DTI. 




Fig. 5. Misclassification accuracy for the MI vs. DTI case (brain promoter 
set) 



Some of the key motifs in the heart and brain case are 
given in Table II. Wherever possible, we indicate if the motif 
corresponds to a known transcription factor binding motif. We 



9 



U 0.15 - 



-i^ Ml 
* DTI 



20 40 60 



80 100 120 140 160 180 200 

Number of features — > 




20 40 60 



80 100 120 140 160 180 200 

Number of features — > 



Fig. 6. Misclassification accuracy for the MI vs. DTI case (heai1 promoter Fig. 7. Misclassification accuracy for the MI vs. DTI case (brain enhancer 



set) 



set) 



note that just because a motif corresponds to a transcription 
factor binding site (TFBS), it does not imply that the TF is 
functional in brain or heart. It might however, be useful to do 
focused experiments to check their functional role. 



Brain 
-promoters 



Heart 
promoters 



Brain 
enhancers 



Ahr-ARNT Pax2 HNF-4 

Tcfll-MafG Tcfll-MafG Nkx 

c-ETS XBPl AMLl 

FREAC-4 Sox- 17 c-ETS 

T3R-alphal FREAC-4 Elkl 

TABLE II 

Comparison of moH ranking motifs (by DTI) across different 

DATA SETS. 



B. Enhancer DB 

We examine all the brain -specific regulatory 
elements profiled in the EnhancerDB database 
{http://http://enhancer.lbl.go\'/) for discriminating motifs. 
Again, the plot of misclassification accuracy vs. number of 
features in the MI and DTI scenarios are indicated in Figs. 

EH 

Some of the top ranking motifs from this dataset are also 
shown in Table II. 

XI. Other Applications 

We now proceed to show other related applications wherein 
the DTI based learning framework is useful. Compared to 
other approaches we can investigate the role of 'any' motif 
(possibly a transcription factor binding motif) both in sequence 
or via expression data. We illustrate this via an example. 

A. 

Suppose we are interested in the transcription factors that 
regulate GataS gene expression. This gene has expression in 



the developing kidney, central nervous systems and hematopoi- 
etic cell differentiation. In concordance with the established 
framework of transcription presented in Section I, recruitment 
of TFs happens at both the proximal promoter as well as 
long-range regulatory elements. A common approach to find 
functional TFBSes from the promoter sequence is to look for 
phylogenetically conserved TF binding sites in the promoter 
sequence among species in which Gata3 is involved in the 
same biological process (here kidney development). 




Fig. 8. Cumulative Distribution Function for bootstrapped I(Pax2 GataS). 
True i{Pax2 GataS) = 0.9911. 

As shown above, this DTI is seen to be significant and 
strong. DTI also enables the integration of microarray time 
series data expression (from the developing kidney) of the 
Pax2 gene to ask if there is any influence from Pax2 to 
Gata3. This is not discussed here but some preliminary work 
is available in [3]. 

B. 

The other question again picks up on something quite 
traditionally done in bioinformatics research - finding key TF 
regulators underlying tissue-specific expression. Again, using 
the Gata3 gene as an example, it has been observed that 
it is expressed in the developing ureteric bud (UB) during 
kidney development. To find UB specific regulators, we look 
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Fig. 9. Cumulative Distribution Function for bootstrapped 

I{Pax2 motif :GTTCC y); V is the class label (UB/non-UB). True 
I{Pax2 GataS) = 0.9792. 

for conserved TF modules in the promoters of UB-specific 
genes. These experimentally annotated UB-specific genes are 
obtained from the Mouse Genome Informatics database at 
http://www.informatics.jax.org/. Several programs are used for 
this kind of analysis, like Genomatix [25] or Toucan [24]. 
Using Toucan, we align the promoters of the various UB 
specific genes, and obtained several related modules. The top- 
ranking module in Toucan contains AHR-ARNT, Hoxl3, Pax2, 
Tallalpha-E47, Octl. We can now check if the corresponding 
motifs can discriminate UB-specific and non-specific genes, 
from DTI. 

We can now check if the Pax2 binding motif (GTTCC 
[18]) induces kidney specific expression by looking for the 
strength of DTI between the GTTCC motif and the class 
label (+1) indicating UB expression. This once again adds to 
computational evidence for the true role of Pax2 in directing 
ureteric bud specific expression [18]. 

XII. Discussion 

From the results above, we observe that the average misclas- 
sification error is higher in the heart/brain promoter datasets 
than for the enhancer data. We speculate this is due to a 
number of reasons: 

• There is more sequence variability at the promoter since 
it has to act in concert with LREs of different tissue types. 

• Since the enhancer/LRE acts with the promoter to confer 
expression in only one tissue type, these sequences are 
more specific and hence their mining identifies motifs that 
are probably more indicative of tissue-specific expression. 

We however, reiterate that the enhancer dataset that we study 
always make use of the hsp68-lacz as the promoter driven by 
the ultraconserved elements. Hence we do not have promoter 
specificity. Though this is a disadvantage and might not reveal 
all key motifs, it is the best that can be done in the absence 
of any other comprehensive repository. 

XIII. Conclusions 

In this work, we have presented a framework for the 
identification of hexamer motifs to discriminate between two 
kinds of sequences (tissue-specific promoters vs non-specific 
or tissue-specific regulatory elements vs non-specific). For this 
feature selection problem we proposed the utility of a new 
metric - the 'directed information' (DTI). In conjunction with 



a support vector machine classifier, this method was shown to 
outperform the state of the art methods employing ordinary 
mutual information. We also find that only a subset of the 
discriminating motifs correlate with known transcription factor 
motifs and hence might be potentially related to underlying 
epigenetic phenomena governing tissue-specific expression. 
The superior performance of the directed-information based 
variable selection suggests its utility to more general sequential 
learning problems. 

We also examine the applicability of DTI for questions 
focusing on any motif of interest, obtained as output from other 
sources (literature, expression data, module searches). Thus 
one can prospectively resolve the role of a TF in a biological 
process. 
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